With this worksheet, you will learn the first steps with Jupyter, Python, pandas and matplotlib using a practical example: We execute an analysis of the futuristic data set "curse log" – a log of time-tracked swearwords events gathered with the microchip implant "FUTURE 2000" from several humans.
We managed to get the log of uttered curses of some users. We also have some user profile data that we can combine with the curse log to find out, who the cursers are.
As a starting point, we have a log from several users recorded in a file that lists the time stamp, the curse word and the id of the user's profile each curse:
timestamp,curse,profile_id
2132-12-31 14:47:43,The A-word,0
2132-12-31 13:13:56,The F-word,0
2132-12-31 13:03:05,The S-word,0
2132-12-31 12:30:34,The F-word,0
2132-12-31 12:29:02,The S-word,0
Hint: This data is a data set from the future. But since time machines aren't invented yet, we are working with a generated / synthetic dataset based on a real data set from an other domain.
Let's get to know the tools we use!
In [1]:
"Hello World"
Out[1]:
ESC key.b key. m.Enter (note the color to the left of the cell, which turns green instead of blue).Ctrl + Enter.This is a text
In [2]:
text = "Hello World"
text
Out[2]:
In [3]:
text[0]
Out[3]:
In [4]:
text[-1]
Out[4]:
In [5]:
text[2:4]
Out[5]:
In [6]:
text.upper()
Out[6]:
split function of text.Shift+Tab.Shift+Tab twice in quick succession.Shift+Tab four times in quick succession (and then ESC to hide) text with split exactly once (parameter maxsplit) apart by using the l ("L") as separator (parameter sep).
In [7]:
text.split("l",maxsplit=1)
Out[7]:
In [8]:
import pandas as pd
pd?
In [9]:
log = pd.read_csv("../dataset/curse_log.gz")
log.head()
Out[9]:
In [10]:
log.info()
We see that log is
timestamp, curse and profile_idcurse in log with the method value_counts().top_curses.top_curses.Note: In this tutorial, we access Series directly with the .<Series> notation (e. g. log.curse). This works only if the names of the Series are different from the provided functions of a Series. E. g. it doesn't work, when you try to access a Series named count, because count() is a function of a Series. Here, you have to use the ['<Series name>'] notation (e.g. log['count']. When in doubt, always use the ['<Series name>'] notation (but which disables the auto-completion feature)
In [11]:
top_curses = log.curse.value_counts()
top_curses
Out[11]:
In [12]:
%matplotlib inline
top_curses.plot()
Out[12]:
In [13]:
top_curses.plot.bar()
Out[13]:
In [14]:
top_curses.plot.bar();
In [15]:
top_curses.plot.pie();
In [16]:
top_curses.plot(
kind='pie',
title="Top curses",
label="",
figsize=[5,5]);
In [17]:
log.timestamp.head()
Out[17]:
In [18]:
ts = pd.to_datetime(log.timestamp)
ts.head()
Out[18]:
In [19]:
log['timestamp'] = ts
log.head()
Out[19]:
In [20]:
log['hour'] = log.timestamp.dt.hour
log.head()
Out[20]:
In [21]:
curses_per_hour = log.hour.value_counts(sort=False)
curses_per_hour.head()
Out[21]:
In [22]:
curses_per_hour.plot.bar();
Now it's time to find out, which users are cursing the most. We have another data set in the Excel file profiles.xlsx with the following content:
The columns contain this information for all FUTURE 2000 users. It includes the unique identification number of a user's profile (matches profile_id in the curse log) as well as the name, birth date, sex and current job.
We combine this data with our log DataFrame to check off the remaining items on our to-do list:
Find out
In [23]:
profiles = pd.read_excel("../dataset/profiles.xlsx")
profiles.head()
Out[23]:
join() on the log DataFrameprofileson='profile_id' to join log's profile_id column with the id (=index) column of the profiles data set.curse_profiles.curse_profiles.
In [24]:
curse_profiles = log.join(profiles, on='profile_id')
curse_profiles.head()
Out[24]:
In [25]:
curse_profiles.name.value_counts().head(10)
Out[25]:
'M') with the selector notation <DataFrame>[<DataFrame>.<Series> == <value>]sex.sex of the DataFrame curse_profiles.
In [26]:
curse_profiles[curse_profiles.sex == 'M'].sex.count() / curse_profiles.sex.count()
Out[26]:
curse_profiles' data along job and curse by using groupby and the list ['job', 'curse'] as argument.sex (or any other left Series).job_curses.
In [27]:
job_curses = curse_profiles.groupby(['job', 'curse']).sex.count()
job_curses.head(10)
Out[27]:
In [28]:
max_per_job = job_curses.groupby('job').transform('max')
max_per_job.head()
Out[28]:
In [29]:
favorite_curses_per_job = job_curses[job_curses == max_per_job]
favorite_curses_per_job.head()
Out[29]:
In [30]:
favorite_curses_per_job.reset_index().curse.value_counts()
Out[30]:
In [31]:
favorite_curse_words = favorite_curses_per_job.unstack()
favorite_curse_words.head()
Out[31]:
In [32]:
favorite_curse_words[~favorite_curse_words['The S-word'].isnull()]
Out[32]:
You have now learned some basics about pandas. This will get us a long way in our daily work. The other important topics that are still missing are:
mergepivot_table. I hope that this mini-tutorial will show you the potential of data analysis using Jupyter, Python, pandas and matplotlib!
I am looking forward to your comments and feedback!
Markus Harrer
Blog: https://www.feststelltaste.de
Mail: talk@markusharrer.de
Twitter: @feststelltaste
Consulting and training: http://markusharrer.de